The dataset describes the different parameters of taking a loan, specifically related to the applicant's features. Fetching a loan depends on the parameters of the applicants, whethere he has home_ownership, the annual income, the state he is living in, delinquencies in his previous accounts, etc. We have done different visualizations to understand the relationships and have also applied machine learning models to predict the interest rate.
pip install plotly
Requirement already satisfied: plotly in c:\users\sneha\anaconda3\lib\site-packages (5.7.0) Requirement already satisfied: tenacity>=6.2.0 in c:\users\sneha\anaconda3\lib\site-packages (from plotly) (8.0.1) Requirement already satisfied: six in c:\users\sneha\anaconda3\lib\site-packages (from plotly) (1.15.0) Note: you may need to restart the kernel to use updated packages.
import numpy as np
import pandas as pd
from pandas import DataFrame
import plotly.express as px
import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt
import ipywidgets
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
df=pd.read_csv('https://raw.githubusercontent.com/SnehaBaranwal/snehab2412.github.io/master/loans_full_schema.csv') #reading the csv file
df
| emp_title | emp_length | state | homeownership | annual_income | verified_income | debt_to_income | annual_income_joint | verification_income_joint | debt_to_income_joint | ... | sub_grade | issue_month | loan_status | initial_listing_status | disbursement_method | balance | paid_total | paid_principal | paid_interest | paid_late_fees | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | global config engineer | 3.0 | NJ | MORTGAGE | 90000.0 | Verified | 18.01 | NaN | NaN | NaN | ... | C3 | Mar-2018 | Current | whole | Cash | 27015.86 | 1999.33 | 984.14 | 1015.19 | 0.0 |
| 1 | warehouse office clerk | 10.0 | HI | RENT | 40000.0 | Not Verified | 5.04 | NaN | NaN | NaN | ... | C1 | Feb-2018 | Current | whole | Cash | 4651.37 | 499.12 | 348.63 | 150.49 | 0.0 |
| 2 | assembly | 3.0 | WI | RENT | 40000.0 | Source Verified | 21.15 | NaN | NaN | NaN | ... | D1 | Feb-2018 | Current | fractional | Cash | 1824.63 | 281.80 | 175.37 | 106.43 | 0.0 |
| 3 | customer service | 1.0 | PA | RENT | 30000.0 | Not Verified | 10.16 | NaN | NaN | NaN | ... | A3 | Jan-2018 | Current | whole | Cash | 18853.26 | 3312.89 | 2746.74 | 566.15 | 0.0 |
| 4 | security supervisor | 10.0 | CA | RENT | 35000.0 | Verified | 57.96 | 57000.0 | Verified | 37.66 | ... | C3 | Mar-2018 | Current | whole | Cash | 21430.15 | 2324.65 | 1569.85 | 754.80 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 9995 | owner | 10.0 | TX | RENT | 108000.0 | Source Verified | 22.28 | NaN | NaN | NaN | ... | A4 | Jan-2018 | Current | whole | Cash | 21586.34 | 2969.80 | 2413.66 | 556.14 | 0.0 |
| 9996 | director | 8.0 | PA | MORTGAGE | 121000.0 | Verified | 32.38 | NaN | NaN | NaN | ... | D3 | Feb-2018 | Current | whole | Cash | 9147.44 | 1456.31 | 852.56 | 603.75 | 0.0 |
| 9997 | toolmaker | 10.0 | CT | MORTGAGE | 67000.0 | Verified | 45.26 | 107000.0 | Source Verified | 29.57 | ... | E2 | Feb-2018 | Current | fractional | Cash | 27617.65 | 4620.80 | 2382.35 | 2238.45 | 0.0 |
| 9998 | manager | 1.0 | WI | MORTGAGE | 80000.0 | Source Verified | 11.99 | NaN | NaN | NaN | ... | A1 | Feb-2018 | Current | whole | Cash | 21518.12 | 2873.31 | 2481.88 | 391.43 | 0.0 |
| 9999 | operations analyst | 3.0 | CT | RENT | 66000.0 | Not Verified | 20.82 | NaN | NaN | NaN | ... | B4 | Feb-2018 | Current | whole | Cash | 11574.83 | 1658.56 | 1225.17 | 433.39 | 0.0 |
10000 rows × 55 columns
df.dtypes
emp_title object emp_length float64 state object homeownership object annual_income float64 verified_income object debt_to_income float64 annual_income_joint float64 verification_income_joint object debt_to_income_joint float64 delinq_2y int64 months_since_last_delinq float64 earliest_credit_line int64 inquiries_last_12m int64 total_credit_lines int64 open_credit_lines int64 total_credit_limit int64 total_credit_utilized int64 num_collections_last_12m int64 num_historical_failed_to_pay int64 months_since_90d_late float64 current_accounts_delinq int64 total_collection_amount_ever int64 current_installment_accounts int64 accounts_opened_24m int64 months_since_last_credit_inquiry float64 num_satisfactory_accounts int64 num_accounts_120d_past_due float64 num_accounts_30d_past_due int64 num_active_debit_accounts int64 total_debit_limit int64 num_total_cc_accounts int64 num_open_cc_accounts int64 num_cc_carrying_balance int64 num_mort_accounts int64 account_never_delinq_percent float64 tax_liens int64 public_record_bankrupt int64 loan_purpose object application_type object loan_amount int64 term int64 interest_rate float64 installment float64 grade object sub_grade object issue_month object loan_status object initial_listing_status object disbursement_method object balance float64 paid_total float64 paid_principal float64 paid_interest float64 paid_late_fees float64 dtype: object
plt.figure(figsize=(15, 10))
plt.subplot(2, 2, 1)
grade = sorted(df.grade.unique().tolist())
sns.countplot(x='grade', data=df, hue='initial_listing_status', order=grade)
plt.subplot(2, 2, 2)
sub_grade = sorted(df.sub_grade.unique().tolist())
g = sns.countplot(x='sub_grade', data=df, hue='initial_listing_status', order=sub_grade)
g.set_xticklabels(g.get_xticklabels(), rotation=90);
ls= df.groupby('loan_status').count()
ls
| emp_title | emp_length | state | homeownership | annual_income | verified_income | debt_to_income | annual_income_joint | verification_income_joint | debt_to_income_joint | ... | grade | sub_grade | issue_month | initial_listing_status | disbursement_method | balance | paid_total | paid_principal | paid_interest | paid_late_fees | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| loan_status | |||||||||||||||||||||
| Charged Off | 6 | 6 | 7 | 7 | 7 | 7 | 7 | 1 | 1 | 1 | ... | 7 | 7 | 7 | 7 | 7 | 7 | 7 | 7 | 7 | 7 |
| Current | 8585 | 8599 | 9375 | 9375 | 9375 | 9375 | 9354 | 1399 | 1362 | 1399 | ... | 9375 | 9375 | 9375 | 9375 | 9375 | 9375 | 9375 | 9375 | 9375 | 9375 |
| Fully Paid | 419 | 420 | 447 | 447 | 447 | 447 | 445 | 59 | 57 | 59 | ... | 447 | 447 | 447 | 447 | 447 | 447 | 447 | 447 | 447 | 447 |
| In Grace Period | 62 | 63 | 67 | 67 | 67 | 67 | 66 | 17 | 16 | 17 | ... | 67 | 67 | 67 | 67 | 67 | 67 | 67 | 67 | 67 | 67 |
| Late (16-30 days) | 36 | 36 | 38 | 38 | 38 | 38 | 38 | 7 | 7 | 7 | ... | 38 | 38 | 38 | 38 | 38 | 38 | 38 | 38 | 38 | 38 |
| Late (31-120 days) | 59 | 59 | 66 | 66 | 66 | 66 | 66 | 12 | 12 | 12 | ... | 66 | 66 | 66 | 66 | 66 | 66 | 66 | 66 | 66 | 66 |
6 rows × 54 columns
ls_title= df['loan_status'].unique()
ls_title
array(['Current', 'Fully Paid', 'In Grace Period', 'Late (31-120 days)',
'Charged Off', 'Late (16-30 days)'], dtype=object)
import plotly.express as px
import numpy
s = df['loan_status'].value_counts() ## Counts the occurrence of unqiue elements and stores in a variable called "s" which is series type
new = pd.DataFrame({'loan_status':s.index, 'Count':s.values})
s.values
array([9375, 447, 67, 66, 38, 7], dtype=int64)
import plotly.express as px
import numpy
fig = px.pie(values=s.values, names=s.index, title='Percenatage of loan_status per category',color=s.index,color_discrete_sequence=px.colors.qualitative.G10)
fig.show()
states= df['state'].unique()
states
array(['NJ', 'HI', 'WI', 'PA', 'CA', 'KY', 'MI', 'AZ', 'NV', 'IL', 'FL',
'SC', 'CO', 'TN', 'TX', 'VA', 'NY', 'GA', 'MO', 'AR', 'MD', 'NC',
'NE', 'WV', 'NH', 'UT', 'DE', 'MA', 'OR', 'OH', 'OK', 'SD', 'MN',
'AL', 'WY', 'LA', 'IN', 'KS', 'MS', 'WA', 'ME', 'VT', 'CT', 'NM',
'AK', 'MT', 'RI', 'ND', 'DC', 'ID'], dtype=object)
amount=df.groupby('state')['loan_amount'].sum()
amount
state AK 632850 AL 1946550 AR 979775 AZ 4113475 CA 22073700 CO 4020525 CT 3258925 DC 394500 DE 300600 FL 11203675 GA 5545300 HI 674500 ID 636950 IL 6776600 IN 2692325 KS 1405500 KY 1509475 LA 1631175 MA 3801875 MD 3902375 ME 347275 MI 4036700 MN 2769525 MO 2651825 MS 1106850 MT 386200 NC 4809225 ND 258500 NE 758825 NH 722225 NJ 5665850 NM 695075 NV 2638750 NY 12589650 OH 5434125 OK 1448350 OR 2129450 PA 4665825 RI 751375 SC 2613900 SD 308700 TN 2651000 TX 13755275 UT 938700 VA 4465225 VT 336775 WA 3860700 WI 2028875 WV 981650 WY 312175 Name: loan_amount, dtype: int64
import pandas as pd
import plotly.express as px
import statsmodels.formula.api as smapi
import numpy as np
fig = px.bar(df, states, amount, color=df['state'].unique(),
labels={
"states": "State",
"amount": "Loan Amount",
},
title="Loan Amount for every State"
)
fig.show()
plt.figure(figsize=(15, 20))
plt.subplot(4, 2, 1)
sns.countplot(x='term', data=df, hue='application_type', palette='RdPu', edgecolor='black')
plt.subplot(4, 2, 2)
sns.countplot(x='homeownership', data=df, hue='application_type', palette='RdPu', edgecolor='black')
plt.subplot(4, 2, 3)
sns.countplot(x='verified_income', data=df, hue='application_type', palette='RdPu',edgecolor='black')
plt.subplot(4, 2, 4)
g = sns.countplot(x='loan_purpose', data=df, hue='application_type', palette='RdPu',edgecolor='black')
g.set_xticklabels(g.get_xticklabels(), rotation=90);
paidtotal=df.groupby('loan_purpose')['paid_total'].sum()
paidtotal
loan_purpose car 1.880338e+05 credit_card 4.999259e+06 debt_consolidation 1.372605e+07 home_improvement 1.766449e+06 house 4.151339e+05 major_purchase 8.362259e+05 medical 4.216902e+05 moving 1.275343e+05 other 1.998309e+06 renewable_energy 1.468471e+04 small_business 3.676864e+05 vacation 8.128827e+04 Name: paid_total, dtype: float64
paidprincipal=df.groupby('loan_purpose')['paid_principal'].sum()
paidprincipal
loan_purpose car 136720.19 credit_card 3791547.82 debt_consolidation 10254801.85 home_improvement 1370570.81 house 330449.70 major_purchase 673944.02 medical 362076.41 moving 103924.02 other 1572277.12 renewable_energy 9332.35 small_business 271651.94 vacation 67188.43 Name: paid_principal, dtype: float64
paidint=df.groupby('loan_purpose')['paid_interest'].sum()
paidint
loan_purpose car 51313.63 credit_card 1207487.60 debt_consolidation 3470732.04 home_improvement 395801.79 house 84631.22 major_purchase 162251.86 medical 59553.64 moving 23610.29 other 425829.13 renewable_energy 5352.36 small_business 96004.41 vacation 14099.84 Name: paid_interest, dtype: float64
paidlate=df.groupby('loan_purpose')['paid_late_fees'].sum()
paidlate
loan_purpose car 0.00 credit_card 223.73 debt_consolidation 518.93 home_improvement 76.62 house 52.98 major_purchase 30.00 medical 60.12 moving 0.00 other 202.78 renewable_energy 0.00 small_business 30.00 vacation 0.00 Name: paid_late_fees, dtype: float64
import plotly.graph_objects as go
from plotly.subplots import make_subplots
fig = make_subplots(rows=2, cols=2, start_cell="bottom-left",subplot_titles=('Paid Interest',
'Paid Late','Paid Total','Paid Principal'))
fig.add_trace(go.Scatter(y=df['loan_purpose'].unique(), x=paidtotal, mode='markers'),
row=1, col=1)
fig.add_trace(go.Scatter(y=df['loan_purpose'].unique(), x=paidprincipal, mode='markers'),
row=1, col=2)
fig.add_trace(go.Scatter(y=df['loan_purpose'].unique(), x=paidint,mode='markers'),
row=2, col=1)
fig.add_trace(go.Scatter(y=df['loan_purpose'].unique(), x=paidlate,mode='markers'),
row=2, col=2)
fig.show()
for col in df.columns:
print(col)
emp_title emp_length state homeownership annual_income verified_income debt_to_income annual_income_joint verification_income_joint debt_to_income_joint delinq_2y months_since_last_delinq earliest_credit_line inquiries_last_12m total_credit_lines open_credit_lines total_credit_limit total_credit_utilized num_collections_last_12m num_historical_failed_to_pay months_since_90d_late current_accounts_delinq total_collection_amount_ever current_installment_accounts accounts_opened_24m months_since_last_credit_inquiry num_satisfactory_accounts num_accounts_120d_past_due num_accounts_30d_past_due num_active_debit_accounts total_debit_limit num_total_cc_accounts num_open_cc_accounts num_cc_carrying_balance num_mort_accounts account_never_delinq_percent tax_liens public_record_bankrupt loan_purpose application_type loan_amount term interest_rate installment grade sub_grade issue_month loan_status initial_listing_status disbursement_method balance paid_total paid_principal paid_interest paid_late_fees
#Create input and output features for the models
output_features = df[["interest_rate"]]
print(output_features)
interest_rate 0 14.07 1 12.61 2 17.09 3 6.72 4 14.07 ... ... 9995 7.35 9996 19.03 9997 23.88 9998 5.32 9999 10.91 [10000 rows x 1 columns]
#Selection of input features that can effect the interest rate
input_features = df[["loan_amount","term","grade","loan_purpose"]]
print(input_features)
loan_amount term grade loan_purpose 0 28000 60 C moving 1 5000 36 C debt_consolidation 2 2000 36 D other 3 21600 36 A debt_consolidation 4 23000 36 C credit_card ... ... ... ... ... 9995 24000 36 A other 9996 10000 36 D debt_consolidation 9997 30000 36 E debt_consolidation 9998 24000 36 A other 9999 12800 36 B credit_card [10000 rows x 4 columns]
#Identified categorical values
input_features["loan_purpose"].unique()
array(['moving', 'debt_consolidation', 'other', 'credit_card',
'home_improvement', 'medical', 'house', 'small_business', 'car',
'major_purchase', 'vacation', 'renewable_energy'], dtype=object)
input_features["grade"].unique()
array(['C', 'D', 'A', 'B', 'F', 'E', 'G'], dtype=object)
output_features["interest_rate"].unique()
array([14.07, 12.61, 17.09, 6.72, 13.59, 11.99, 6.71, 15.04, 9.92,
9.43, 19.03, 28.72, 26.77, 15.05, 6.08, 11.98, 7.96, 7.34,
5.32, 6.07, 12.62, 9.44, 20.39, 9.93, 21.45, 10.42, 18.06,
22.91, 30.79, 17.47, 5.31, 7.97, 14.08, 19.42, 10.91, 16.02,
13.58, 16.01, 20. , 21.85, 10.9 , 23.87, 7.35, 23.88, 25.82,
10.41, 18.45, 30.17, 24.85, 25.81, 24.84, 30.75, 29.69, 26.3 ,
22.9 , 6. , 30.65, 30.94])
#Applied get_dummies() method and then dropped the old columns
dummies = pd.get_dummies(input_features.grade)
merged = pd.concat([input_features, dummies], axis='columns')
merged
| loan_amount | term | grade | loan_purpose | A | B | C | D | E | F | G | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 28000 | 60 | C | moving | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 1 | 5000 | 36 | C | debt_consolidation | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 2 | 2000 | 36 | D | other | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 3 | 21600 | 36 | A | debt_consolidation | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 23000 | 36 | C | credit_card | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 9995 | 24000 | 36 | A | other | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 9996 | 10000 | 36 | D | debt_consolidation | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 9997 | 30000 | 36 | E | debt_consolidation | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 9998 | 24000 | 36 | A | other | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 9999 | 12800 | 36 | B | credit_card | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
10000 rows × 11 columns
dummies2 = pd.get_dummies(merged.loan_purpose)
merged2 = pd.concat([merged, dummies2], axis='columns')
merged2
| loan_amount | term | grade | loan_purpose | A | B | C | D | E | F | ... | debt_consolidation | home_improvement | house | major_purchase | medical | moving | other | renewable_energy | small_business | vacation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 28000 | 60 | C | moving | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 1 | 5000 | 36 | C | debt_consolidation | 0 | 0 | 1 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 2000 | 36 | D | other | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 3 | 21600 | 36 | A | debt_consolidation | 1 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 23000 | 36 | C | credit_card | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 9995 | 24000 | 36 | A | other | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 9996 | 10000 | 36 | D | debt_consolidation | 0 | 0 | 0 | 1 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 9997 | 30000 | 36 | E | debt_consolidation | 0 | 0 | 0 | 0 | 1 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 9998 | 24000 | 36 | A | other | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 9999 | 12800 | 36 | B | credit_card | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
10000 rows × 23 columns
input_features_final = merged2.drop(['loan_purpose', 'grade'], axis='columns')
input_features_final
| loan_amount | term | A | B | C | D | E | F | G | car | ... | debt_consolidation | home_improvement | house | major_purchase | medical | moving | other | renewable_energy | small_business | vacation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 28000 | 60 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 1 | 5000 | 36 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 2000 | 36 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 3 | 21600 | 36 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 23000 | 36 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 9995 | 24000 | 36 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 9996 | 10000 | 36 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 9997 | 30000 | 36 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 9998 | 24000 | 36 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 9999 | 12800 | 36 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
10000 rows × 21 columns
input_features_final.isnull().values.any()
False
# #Convert the output features from floats to ints as the classifier expects categorical data
# output_features['interest_rate'] = output_features['interest_rate'].astype(int)
#Split the dataset into train and test inputs for the model
X_train, X_test, y_train, y_test = train_test_split(input_features_final,output_features,test_size = 0.3)
#Checking the consistency of the data sets
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
(7000, 21) (3000, 21) (7000, 1) (3000, 1)
model_y_train = y_train["interest_rate"].astype(int)
model_y_test = y_test["interest_rate"].astype(int)
X_train
| loan_amount | term | A | B | C | D | E | F | G | car | ... | debt_consolidation | home_improvement | house | major_purchase | medical | moving | other | renewable_energy | small_business | vacation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3157 | 10000 | 36 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 8260 | 7900 | 36 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4717 | 5000 | 36 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1087 | 21000 | 60 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 6293 | 12000 | 36 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2881 | 4000 | 36 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 449 | 20000 | 60 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 6118 | 16000 | 60 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5550 | 9000 | 36 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5463 | 25000 | 36 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
7000 rows × 21 columns
y_train
| interest_rate | |
|---|---|
| 3157 | 9.43 |
| 8260 | 6.71 |
| 4717 | 6.72 |
| 1087 | 26.77 |
| 6293 | 12.61 |
| ... | ... |
| 2881 | 15.05 |
| 449 | 7.96 |
| 6118 | 16.02 |
| 5550 | 9.43 |
| 5463 | 6.72 |
7000 rows × 1 columns
model_y_train
3157 9
8260 6
4717 6
1087 26
6293 12
..
2881 15
449 7
6118 16
5550 9
5463 6
Name: interest_rate, Length: 7000, dtype: int32
#Import the model (Random Forest Classifier)
model = RandomForestClassifier(bootstrap=False, max_depth=90, max_features='sqrt',
min_samples_split=5, n_estimators=800, random_state=0)
model.fit(X_train, model_y_train)
RandomForestClassifier(bootstrap=False, max_depth=90, max_features='sqrt',
min_samples_split=5, n_estimators=800, random_state=0)
predict = model.predict(X_test)
predict
array([13, 21, 6, ..., 5, 12, 5])
answer = np.round(predict, 2)
print(answer)
[13 21 6 ... 5 12 5]
mean_squared_error(model_y_test,answer) # We can see the mean squared error is bit high
2.472
y_test= y_test.to_numpy()
y_test = [item for sublist in y_test for item in sublist]
plt.figure(figsize=(10,10))
plt.scatter(y_test, answer, c='crimson')
p1 = max(max(answer), max(y_test))
p2 = min(min(answer), min(y_test))
plt.plot([p1, p2], [p1, p2], 'b-')
plt.xlabel('True Values', fontsize=15)
plt.ylabel('Predictions', fontsize=15)
plt.axis('equal')
plt.show()
#Linear Regression
model2 = LinearRegression()
model2.fit(X_train, model_y_train)
LinearRegression()
max(y_test)
30.79
predict = model2.predict(X_test)
print(predict)
[13.84319321 18.67746091 6.3426567 ... 6.19649795 13.94977843 6.18251406]
answer = np.round(predict, 2)
print(answer)
[13.84 18.68 6.34 ... 6.2 13.95 6.18]
y_test
[13.59, 17.47, 7.97, 7.97, 13.59, 28.72, 11.99, 7.35, 12.62, 18.45, 9.93, 13.59, 21.45, 10.41, 12.61, 21.85, 9.43, 13.59, 13.58, 12.61, 21.45, 10.41, 6.08, 13.59, 10.91, 18.45, 6.72, 9.44, 7.35, 10.91, 16.01, 20.0, 19.42, 10.42, 7.97, 13.59, 11.99, 5.32, 9.44, 13.59, 7.97, 19.03, 15.04, 7.97, 9.92, 15.04, 13.58, 11.99, 10.41, 14.08, 21.45, 9.93, 15.05, 16.01, 12.62, 24.85, 6.72, 9.43, 9.92, 7.97, 10.41, 10.41, 18.06, 13.58, 9.44, 9.92, 7.96, 10.9, 17.09, 5.31, 15.04, 7.35, 9.44, 6.07, 7.35, 12.62, 9.93, 13.58, 12.62, 6.08, 12.61, 9.93, 6.72, 7.34, 6.72, 7.97, 6.72, 7.96, 18.06, 20.0, 20.39, 29.69, 10.42, 14.07, 24.85, 9.44, 18.45, 17.09, 18.45, 14.07, 17.09, 5.32, 25.82, 12.62, 22.91, 17.09, 9.92, 13.59, 7.96, 10.42, 15.05, 5.31, 6.08, 14.07, 14.07, 12.61, 15.04, 15.04, 14.08, 9.92, 20.0, 19.03, 24.84, 24.85, 10.91, 5.32, 9.43, 10.42, 9.92, 13.59, 7.34, 11.99, 10.42, 7.34, 13.59, 14.08, 5.32, 6.08, 26.77, 12.62, 14.08, 23.88, 7.35, 16.02, 7.96, 17.09, 17.47, 11.99, 7.97, 17.09, 11.99, 9.93, 13.59, 10.91, 12.62, 7.34, 9.92, 7.96, 7.34, 7.96, 19.42, 11.99, 7.34, 7.34, 18.06, 17.09, 9.93, 18.06, 20.39, 7.97, 11.98, 7.96, 5.32, 11.98, 14.08, 6.07, 9.44, 10.42, 5.32, 13.58, 7.34, 10.41, 9.43, 7.35, 16.02, 5.32, 6.07, 7.34, 5.32, 6.08, 16.02, 6.08, 25.82, 12.62, 14.08, 7.35, 14.07, 11.99, 25.81, 17.09, 20.39, 6.72, 13.59, 13.59, 11.98, 11.99, 24.85, 10.42, 12.62, 9.44, 5.32, 16.02, 5.31, 5.31, 9.92, 15.04, 12.62, 15.05, 10.91, 11.99, 15.04, 10.42, 6.07, 19.42, 11.98, 13.58, 10.42, 11.99, 19.42, 21.85, 10.42, 10.9, 16.02, 9.44, 9.92, 11.99, 20.0, 24.84, 24.85, 9.93, 9.44, 10.91, 6.08, 9.92, 17.09, 13.59, 26.3, 16.02, 15.04, 5.32, 15.05, 9.92, 10.41, 7.35, 13.58, 20.0, 24.84, 7.97, 10.42, 19.03, 6.08, 15.04, 6.08, 5.31, 5.32, 7.96, 7.35, 11.98, 9.43, 14.08, 9.92, 10.42, 5.32, 14.08, 26.77, 9.44, 15.05, 12.61, 7.97, 19.42, 16.01, 13.58, 11.98, 7.34, 6.08, 7.96, 11.98, 18.45, 10.42, 16.01, 9.93, 5.31, 21.45, 9.44, 7.96, 16.02, 13.58, 10.9, 10.9, 13.58, 7.34, 12.62, 6.08, 5.31, 7.97, 12.61, 12.61, 10.9, 7.96, 7.35, 10.9, 9.93, 21.45, 15.05, 9.93, 6.08, 9.44, 20.0, 12.61, 6.72, 11.98, 9.92, 19.42, 15.04, 5.31, 20.39, 10.91, 16.01, 15.05, 6.71, 14.08, 6.08, 7.97, 5.32, 20.39, 6.72, 6.08, 16.01, 6.08, 11.99, 16.02, 12.61, 19.03, 18.45, 30.17, 7.35, 9.93, 5.32, 12.61, 10.41, 6.07, 9.43, 9.43, 20.0, 11.99, 22.9, 7.97, 10.42, 9.93, 20.39, 13.59, 7.35, 7.97, 14.07, 21.85, 11.98, 10.42, 17.47, 7.35, 6.72, 9.92, 7.97, 6.72, 18.06, 7.96, 9.43, 16.02, 5.32, 11.98, 9.43, 12.61, 15.04, 18.45, 14.08, 7.35, 12.62, 13.58, 9.93, 11.98, 10.91, 6.72, 26.3, 11.98, 6.72, 6.72, 11.99, 9.93, 17.09, 12.61, 7.97, 20.0, 23.88, 6.08, 6.71, 21.85, 26.77, 9.43, 17.09, 18.45, 12.61, 15.05, 7.35, 10.91, 12.61, 7.97, 13.59, 13.59, 10.9, 12.61, 7.35, 12.61, 19.03, 6.71, 19.03, 18.06, 24.85, 10.9, 14.07, 9.44, 15.04, 10.9, 19.03, 10.42, 7.35, 6.08, 24.85, 15.04, 16.02, 10.9, 10.9, 5.32, 9.92, 11.99, 12.62, 7.97, 10.9, 12.62, 9.44, 6.72, 21.85, 5.31, 5.31, 6.71, 7.97, 13.58, 9.93, 9.93, 14.08, 11.99, 7.96, 20.39, 9.44, 7.34, 12.61, 15.04, 22.91, 9.43, 10.42, 6.08, 9.43, 7.34, 17.09, 16.01, 11.99, 12.62, 18.45, 9.93, 9.44, 13.59, 13.58, 17.09, 10.42, 20.0, 13.58, 15.04, 10.9, 10.91, 13.58, 16.02, 13.59, 7.35, 10.9, 6.72, 17.09, 5.32, 15.04, 16.02, 7.96, 20.0, 7.97, 6.71, 19.03, 6.08, 9.93, 9.44, 6.08, 7.97, 7.35, 5.32, 17.09, 6.72, 10.9, 10.9, 15.05, 26.3, 10.41, 15.05, 7.34, 20.0, 10.91, 11.99, 17.09, 16.01, 9.93, 20.39, 10.91, 15.04, 9.93, 10.91, 7.96, 26.3, 12.62, 11.98, 18.06, 14.07, 15.04, 12.61, 10.41, 7.34, 14.07, 16.01, 10.42, 10.9, 10.41, 13.59, 15.05, 17.09, 6.07, 9.93, 11.99, 15.05, 12.62, 15.04, 18.45, 6.71, 14.08, 6.72, 11.99, 9.93, 6.71, 15.05, 12.62, 9.44, 9.44, 7.34, 7.34, 20.39, 13.59, 16.02, 9.44, 5.32, 6.08, 9.93, 12.61, 9.43, 9.43, 12.62, 9.93, 14.07, 14.08, 11.98, 12.62, 14.08, 9.43, 9.93, 9.43, 12.62, 11.98, 13.58, 9.93, 16.02, 11.99, 16.01, 16.02, 14.08, 19.03, 6.72, 6.72, 9.93, 9.43, 7.35, 10.42, 16.02, 10.91, 6.07, 11.98, 7.96, 26.3, 6.72, 21.45, 13.58, 12.62, 6.71, 10.9, 11.99, 7.97, 7.97, 9.93, 9.44, 10.9, 11.99, 9.92, 9.93, 9.43, 5.32, 9.92, 6.08, 9.43, 15.05, 12.61, 9.92, 5.32, 10.91, 11.98, 13.59, 23.88, 7.96, 14.08, 6.07, 9.92, 13.58, 14.08, 13.59, 11.99, 6.08, 11.99, 17.47, 14.07, 14.07, 9.44, 7.96, 16.01, 21.45, 7.34, 9.44, 11.99, 7.34, 7.35, 16.01, 13.58, 6.08, 5.31, 11.99, 16.02, 11.98, 11.99, 20.39, 16.01, 10.91, 6.72, 10.9, 9.92, 11.98, 15.05, 18.06, 7.35, 7.96, 10.42, 7.35, 11.98, 11.98, 12.61, 17.09, 16.01, 12.62, 25.82, 10.42, 13.58, 9.92, 6.07, 7.35, 14.08, 9.93, 7.35, 11.99, 17.09, 15.05, 7.34, 6.07, 7.34, 6.72, 10.9, 10.9, 13.59, 16.02, 7.34, 17.47, 10.91, 11.98, 20.39, 13.59, 17.47, 7.35, 11.99, 9.43, 10.42, 22.91, 15.04, 9.93, 6.72, 6.07, 20.0, 10.42, 6.71, 6.07, 9.93, 16.01, 13.59, 6.07, 15.04, 13.59, 10.41, 7.35, 19.03, 7.96, 7.35, 9.93, 25.82, 7.97, 12.61, 5.32, 14.08, 9.44, 15.04, 11.98, 20.0, 5.31, 6.08, 12.61, 6.72, 11.98, 12.61, 10.91, 18.06, 9.43, 9.92, 19.03, 12.61, 7.96, 16.02, 9.92, 6.71, 20.0, 13.59, 13.58, 20.0, 9.93, 15.04, 26.3, 16.02, 5.31, 12.62, 15.04, 6.08, 7.34, 9.93, 9.44, 13.58, 6.07, 11.99, 7.97, 7.96, 10.42, 18.06, 14.07, 5.32, 10.42, 10.41, 15.04, 20.0, 9.92, 15.05, 7.35, 12.61, 7.35, 6.71, 11.98, 12.62, 6.07, 9.44, 12.62, 12.61, 9.43, 10.9, 9.44, 6.72, 9.93, 5.31, 16.01, 16.01, 9.43, 12.62, 9.44, 21.45, 7.97, 7.35, 5.32, 7.96, 6.07, 6.08, 11.99, 20.39, 20.39, 6.08, 7.35, 20.39, 12.62, 6.71, 13.59, 10.91, 13.58, 9.93, 15.04, 6.71, 6.72, 16.02, 5.31, 12.62, 14.08, 9.43, 10.9, 15.04, 11.99, 7.96, 9.93, 11.99, 16.01, 15.04, 11.98, 24.84, 12.61, 19.03, 14.07, 20.39, 16.02, 19.03, 22.91, 20.39, 6.72, 12.61, 13.59, 22.91, 5.32, 12.62, 7.97, 7.35, 10.9, 13.58, 11.99, 16.02, 10.42, 10.42, 9.92, 13.58, 10.91, 9.43, 9.43, 6.72, 16.02, 12.61, 10.9, 21.45, 7.96, 9.93, 19.03, 20.39, 10.41, 18.45, 9.92, 18.06, 14.08, 11.98, 23.88, 6.08, 10.41, 13.59, 13.59, 7.96, 7.35, 12.61, 7.35, 9.93, 18.06, 5.31, 20.0, 19.42, 16.02, 6.72, 7.96, 9.92, 13.59, 14.08, 19.42, 9.44, 9.44, 16.01, 9.44, 15.05, 9.44, 5.31, 14.07, 13.59, 16.01, 17.47, 11.99, 14.08, 10.91, 10.9, 7.96, 7.97, 9.44, 21.45, 7.96, 7.97, 21.85, 12.61, 9.93, 14.08, 19.42, 16.02, 9.44, 6.72, 12.62, 12.61, 10.42, 10.42, 6.72, 9.93, 10.9, 19.42, 21.45, 14.07, 15.05, 18.06, 22.91, 12.62, 10.9, 10.91, 7.97, 16.01, 30.79, 21.45, 5.31, 6.08, 18.06, 7.96, 6.07, 16.02, 9.92, 6.72, 16.02, 23.88, 14.08, 16.01, 6.07, 16.01, 12.62, 11.99, 5.31, 5.32, 14.08, 11.99, 10.9, 6.08, 11.99, 25.82, 9.93, 10.42, 10.42, 17.47, 11.98, 30.17, 7.97, 14.07, 7.34, 10.9, 13.59, 10.9, 7.34, 15.04, ...]
mean_squared_error(y_test,answer)
1.4872128000000004
Here we can conclude that the Linear regression would work better than the Random forest classifier as the mean_squared error of linear regression is lower than the random forest algorithm.
plt.figure(figsize=(10,10))
plt.scatter(y_test, answer, c='crimson')
p1 = max(max(answer), max(y_test))
p2 = min(min(answer), min(y_test))
plt.plot([p1, p2], [p1, p2], 'b-')
plt.xlabel('True Values', fontsize=15)
plt.ylabel('Predictions', fontsize=15)
plt.axis('equal')
plt.show()
I could have used hyperparameter tuning which allows us to make use of the optimal parameters to the machine learning models which increases the performance and computaional time of the models.